-
Notifications
You must be signed in to change notification settings - Fork 851
Assert if operations on keep_alive_queue and active_queue are not thread-safe #4379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mingzym
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, hope this will help us finish the long standing mysterious issues in the plugin & transaction system
|
We were also seeing some crashes with the active_queue_link as well. Resulting in stack traces like. And digging in, we saw that the nh from another thread was showing up on the stack (in frame 3) in the stack above. My current theory is that the problem was occurring from reusing a stale client_vc. And I thought we fixed it by removing release_netvc ("Fix crash in Http2ClientSession::release_netvc" on master). As we have deployed this fix, we are no longer seeing the active_queue crash. The TSHttpSMCallback is schedule_imm from TSHttpTxnReenable which should schedule the TSHttpSMCallback on the same thread that called the Reenable. I guess for many plugins this would be a Task thread. Is that your scenarios? If so, your fix seems reasonable, but I'm a bit leery of the blocking locks. @duke8253 is working on a thread affinity. I think we already have sufficient information in this case that we can just schedule on the thread associated with the HttpSM. Given a HttpSM sm, we should be able to call sm->ua_txn->adjust_thread(cont, event, data) |
|
@shinrich Yes, I found the internal queue of NetHandler is accessed from another thread without acquire the NetHandler's lock. If you have a bit leery of the blocking locks, I can just leave ink_release_assert check. |
|
[approve ci autest] |
|
@oknet indicated that we can not take nh lock directly since it might cause dead lock.(nh first take their own lock and then acquired the vc lock, now vc is possible to access nh lock under vc's lock). |
|
The autest always failed on H2 test after I just leave ink_release_assert here. I think there are some bugs related with H2 codes. @masaori335 ccid_ctid, double_h2 and http2_priority failed: The |
|
openclose failed: I will create another PR to fix it. |
…ead-safe The `vc->add_to_keep_alive_queue()` is called by: - Http1ClientSession::release() - Http2ClientSession::cleanup_streams() - Http2ClientSession::release_stream() And the `NetHandler::remove_from_active_queue()` is called by `vc->add_to_keep_alive_queue()`. The NetHandler::keep_alive_queue and NetHandler::active_queue are internal queue of NetHandler. We must have acquired the NetHandler's lock before doing anything with them. When reenable a HttpSM from plugin, the HttpSM would be run into ethread that different from client_vc lives because TSHttpSMCallback is scheduled by `eventProcessor.schedule_imm()`.
|
My view would be to fix it in |
|
Running this on docs.trafficserver. |
|
For the asserts, it would be interesting to see if the lock wasn't being held. Or if it is being held by another thread. For the autest stack traces I wouldn't be surprised if the nh mutex just wasn't held. |
|
@SolidWallOfCode @shinrich @zwoop @scw00 We should choose a rule to operating the keep_alive_queue and active_queue:
|
|
I think we should fix TSHttpTxnReenable to schedule to the correct ET_NET thread (satisfying option 1) rather than some thread in the ET_NET pool. It should still hold the lock. |
|
@zwoop any assert reported on docs? |
|
No asserts on Docs :/ Do we want this PR into 7? If so, can you please make a "git cherry-pick -x" backport into the 7.1.x branch? Update: of course, don't make a PR against 7.1.x until this has landed on master! |
scw00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to approve this pr event if it may cause crash.
|
This is cherry-picked into 7.1.6 via #4787 |
… not thread-safe" This reverts commit 3599e5e. Seems this causes new, unpredcitable crashes. For now, lets revert this, and we can keep working on this for 8.x / 9.x. Reference apache#4379 and apache#4921 and apache#5120.
|
Cherry picked to 8.1.0 |
The
vc->add_to_keep_alive_queue()is called by:And the
NetHandler::remove_from_active_queue()is called byvc->add_to_keep_alive_queue().The NetHandler::keep_alive_queue and NetHandler::active_queue are
internal queue of NetHandler. We must have acquired the NetHandler's
lock before doing anything with them.
When reenable a HttpSM from plugin, the HttpSM would be run into ethread
that different from client_vc lives because TSHttpSMCallback is
scheduled by
eventProcessor.schedule_imm().